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Abstract 

We present a gradient-based algorithm for unconstrained minimization derived 
from iterated linear change of basis. The new method is equivalent to linear con- 
jugate gradient in the case of a quadratic objective function. In the case of exact 
line search it is a secant method. In practice, it performs comparably to BFGS and 
DFP and is sometimes more robust. 



1 Iterated linear change of basis 

We consider the problem of minimizing a differentiate function / : R n — > R with no 
constraints on the variables. We propose the following algorithm for this problem. We 
assume a starting point w is given. Let /o be identified with /. 

Algorithm 1 

[1] for k = 1,2,... 

[2] p fe :=-V/ fe _i(w fc _i); 

[3] a k := argmin{/ fe _i(w fc _i + ap k ) : a > 0}; 

[4] w fc := Wfc_i + a k p k ; 

[5] gfc := — V/fc_i(wfc); 

[6] Define l k : R" - R n by Z fc (x) = (/ + P fcg^/||pfc|| 2 )(x); 

[7] fk ■= fk-i ° h; 

[8] w fc := ^(Wfe); 

[9] end 

Lines [1]— [4] of this algorithm are the standard steepest descent computation. In the third 
line, an inexact line search may be used in place of exact minimization over a. In the sixth 
line, I is the n x n identity matrix. The seventh line indicates functional composition: a 
new objective function is formed as the composition of the old objective function and a 
linear change of variables. 
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The eighth line applies the inverse transformation to Wfc so as to enforce the relation- 
ship /fc_i(wfc) = /fc(wfc). The inverse transform is efficiently computed and applied using 
the Sherman-Morrison formula. Although the function value is invariant, the gradient 
value is not, so the algorithm is not equivalent to a sequence of steepest descent steps 
in the original coordinates. This algorithm is equivalent to the linear conjugate gradient 
algorithm in the case that / is a convex quadratic function and the line search is exact, 
as we shall see in Section [3j 

When the algorithm terminates, say at iteration N, the vector is a minimizer or 
approximate minimizer of f^ N '. Therefore, the linear transformations must be saved and 
applied to in order to recover a solution to the original problem. 

In certain special classes of problems, it may be feasible to implement the algorithm 
exactly as stated because the objective function may be accessible for updating. More 
commonly, however, the objective function is available only as a subroutine, in which 
case the algorithm must be restated in a way so that it keeps track of the linear updates 
itself. In particular, it must save the two vectors defining the linear transformation from 
all previous iterations. Then the chain rule is applied, which states that if gr(x) = /(/(x)), 
where I is a linear function, then Vg(x) = / T (V/(/(x))), where l T denotes the transposed 
linear function. This version of the algorithm is as follows. There is no longer a subscript 
on / since / is not explicitly updated in this version. The current iterate in this algorithm 
is denoted Xfc and must be initialized as xo, which is equal to wo in Algorithm 1. 

Algorithm 2 

[1] for k = 1,2, ... 

[2] Pfe :=-#_ 1 o...oZf(V/(x fc _ 1 )); 

[3] m fc := l x o • • • o Z fc _i(p fc ); 

[4] a k := argmin{/(x fc _i + am k ) : a > 0}; 

[5] x fc := x fc _i + a k m k ; 

[6] gfe :=-ZLi °"--<(W(x fe )); 

[7] Define l k : R™ - R" by fc(x) = (/ + p fc g£/||p fc || 2 )x; 

[8] end 

The fact that Algorithms 1 and 2 are equivalent is an easy induction. The variables 
Pfc, a k and gfc are identical between the two algorithms, as are the sequences of linear 
transformations l k . The remaining variables have the following relationships: ~x. k — l\o 
• - • o /fc(wfc) and Xfc = /i o • ■ ■ o l k _i(w k ). Note that some redundant computation in step 
[2] can be saved by observing that 

Pfc = /jfe_i(gfc-i)> (1) 

where gfc_i was computed in step [6] of the previous iteration. 

We conclude this section with a result concerning the invertibility of the linear trans- 
formations. 

Lemma 1. Assume f is C 1 and none of the iterates in Algorithm 1 is a stationary point. 
Suppose an exact line search is used in Algorithm 1. Then l k is invertible on every step. 
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Proof. It follows from standard theory of steepest descent that if an exact line search is 
used, then 

Plgk = 0. (2) 

(This is the first-order condition for the optimality of a for the differentiable function 
/(wfc_i + apt)). In this case, / + Pkgl ' I (PfePfc) is invertible since it follows from the 
Sherman- Morrison formula that / + uv T is invertible unless u T v = — 1. □ 

We remark that many kinds of inexact line searches will also yield the same result. 
The requirement for invertibility of 4 is that p^p^ 7^ — g^Pfc- Written in terms of the line 
search function 0(a) = f k (wk + apfc), this is the same as saying that (fi'(0) 7^ — 0'(afc). A 
line search will often enforce the condition ^'(ct) | < |0'(O)|. 

Steps [2]-[3] of Algorithm 2 may be written as = — i/&V/(xfc_i), where Hk = 
l\ • • • lk-ilk-i • • • It ■ Obviously, H k is positive semidefinite, and assuming the condition in 
the previous paragraph holds, it is positive definite. This means that Algorithm 2 always 
produces descent directions except in the unexpected case that p^p/c = — g£ Pk- 



2 Specialization to quadratic functions 

In this section we present some results on the specialization of Algorithm 1 to convex 
quadratic functions with exact line search. In particular, we prove finite termination 
of the algorithm. Finite termination is also a consequence of the equivalence to linear 
conjugate gradient (discussed in the next section), but the proof presented here is a short 
self-contained proof that follows different lines from customary proofs of finite termination. 
The difference arises from the fact that Algorithm 1 is a one-step method (i.e., it does not 
involve recurrences), and therefore its analysis does not require an induction hypothesis 
that spans the iterations as in the customary analysis. 

Suppose that /(w) = /o( w ) — w T A w/2 — bp w, where A e R, nxn i s symmetric and 
positive definite. Then it follows from step [7] of Algorithm 1 that /fc(w) = w T Akw/2 — 
b^w, where Ak = 1% ■ ■ ■ IfAoh ■ ■ - Ik and b^ = Ij. ■ ■ ■ lj b . 

In the case of quadratic functions, the optimal choice of ak in step [3] of Algorithm 1 
is well known to be (see [1]) 

«fc = -r-j . (3) 

Pfc Ak-!P k 

We can develop the following further relationships. Combining ([I]) and (T5J) yields 

Pfc+i = ^l(gfe) 

V PkPkJ 

= gfc- 

Also, 

g/t = Pfe+i = -V/fc_i(wjfe) 

= -A k -!W k + bk-t 
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= -A fc _iw fc + b fc _i - a k A k -ip k 
= Pfc - "fc^fc-iPfc- 



With these relationships in hand, we can now propose the main result that implies to 
finite termination. 

Let us introduce the following notation: 

K(A k -i, Pfe) = span(p fc , A fc _ip fc , Al_iPk, ■■■), 
i.e., the minimal invariant subspace of A k ^\ that contains pfc. 

Theorem 1. The invariant subspace K(A k , Pfc+i) is a proper subspace of K(Ak-i, Pfe)- 

Proof. First, observe that p fc+1 (= gfc) G p fe ), which follows from equality demon- 

strated above that p fc+1 = p fc — A k _ip k . Next, we claim more generally that K(A k , p&+i) C 
^(Ajfc-i, pfc). This follows because Afc = l k A k -±l k . The three operators 4, l k and Afc_i all 
map K(A k -i, p k ) into itself since gfc and pfc are both already proven to lie in this space. 
Thus, A k maps K(A k -i,p k ) into itself. 

Thus, we have shown K(A k ,p k+1 ) C K(A k _i,p k ). To conclude the proof, we must 
show that it is a proper subspace. We claim that K(A k , p k+ i) C p^. Observe first that 
Pfc+i £ P^; this follows immediately from (j2j). Next, it is obvious from the definition of 
Zfc that pfc is a right eigenvector of l k . Furthermore, pfc is a right eigenvector of l^Ak-i, 
as we see from the following algebra: 

l T k A k ^p k = (i+^fi) Ak-iPk 
\ PkPkJ 

A j_ Pk A k-lPk 

— ^fc-lPfc + gfc t 

Pfe Pfc 

= Afc-iPfc + gfe/«fe 

= ^4fc-iPfc + (Pfe - ttfc^4fc-iPfc)/afc 

= Pfc/Ofc. 

The statement under consideration K(A k , p k+ i) C p^ can be rewritten as the equation 
p^yl fc pfc +1 = for all i, i.e., p^ (l k A k _il k ) % g k = 0. But (l k A k _il k ) z can be factored as 
products of l k and A k _il kl and we have just proved that p k is a left eigenvector of both 
of these operators. Therefore, pl(l k A k ^il k ) l g k = scalar ■ pj[ gfc = 0. 
Therefore, we have proved that 

K(A k , P fc +1 ) c K(A k -!, Pfe) n p^ 

Thus, to show that K(A k , p k +i) is a proper subset of K(A k -i, p k ), it suffices to show that 
K(A k -i,Pk) is not a subspace of p k . But this is obvious, since the former contains pfe 
while the latter does not. □ 

This theorem proves finite termination of Algorithm 1: the dimension of the invariant 
subspace at iteration is at most n, and the dimension shrinks by at least 1 each iteration, 
so therefore the algorithm terminates in at most n iterations. 
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More strongly, if the coefficient matrix Aq has at most s distinct eigenvalues, then 
Algorithm 1 terminates in at most s iterations, since any vector lies in an invariant 
subspace of dimension at most s for such a matrix. 

Finally, the above theorem suggests that Algorithm 1 converges superlinearly. We 
recall the following two facts (see PQ): the steepest descent algorithm applied to a convex 
quadratic function converges at a rate proportional to the condition number of the matrix. 
Furthermore, the condition number of a matrix acting on a subspace can never exceed 
(and is usually less than) the condition number of the matrix acting on the whole space, 
a consequence of the Courant-Fisher minimax theorem. Thus, we see that Algorithm 1 
consists of steepest descent in ever smaller invariant subspaces, so the effective condition 
number of the matrix decreases (or at least, does not increase) each iteration and hence 
the convergence rate is expected to be superlinear. 



3 Equivalence to linear conjugate gradient 

Again, we assume for this section that /(x) = x T Ax/2 — b T x, where A G R" xn is 
symmetric and positive definite. We assume again that the line search is exact. We prove 
that Algorithm 2 is equivalent to linear conjugate gradient. For the sake of completeness, 
let us write linear conjugate gradient in its usual form as follows. Let xo be given. 

Algorithm Linear-CG 

[I] r := b - Ax ; 
[2] for k = 1, 2, . . . 
[3] if k = 1 

[4] ii! = r ; 

[5] else 

[6] fa = rLi r fe-i/ ( r L 2 r fc-2); 

[7] n k = /3 fc n fc _i + r fc _i; 

[8] end 

[9] «fc = liVfe-i/ (nfAn fc ); 

[10] x fc = x fc _i + a k n k ; 

[II] r fc = r fe _! - a k An k ; 
[12] end 

Well known properties of Linear-CG are that = b — A~x k = — V/ (x&) and that the r k s 
are mutually orthogonal (see [T]). We claim that Algorithm 2 and Algorithm Linear-CG 
are equivalent with the following relationships among the variables: = r^-i; = n k ; 
gfc = rfc, and a k is the same between the algorithms. This equivalence is proved by 
induction. For the k — 1 case, it is clear that pi = mi = ni = r and gi = ri. For k > 1, 
we see that 

Pk = -ll-l ■ • ■ ^V/(x fe _!) 

p*-iP*-i/ V Pi pi/ 
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= r fc _i. 

The second and third line both involved application of the induction hyptohesis, and the 
last line follows because all terms drop out from the product with r^-i except the identity 
because the r^s are mutually orthogonal. 

Next, we show by induction that = m^. Observe from step [7] of Linear-CG that 
n fc - /3 fc n fc _i = r fc _! while 

a ( r k-i r k-i\ 
m fc - (JkiUk-t = m k - I — m fc _i 




rfc-2 



In the above derivation, we applied the induction hypothesis, the definition of h-i, and, 
for the last line, again the fact that the r^'s are mutually orthogonal. This equation proves 
that rifc — Pk^-k-i = —ftkHik-u hence the sequence of m^'s and n^'s are equal. Finally, 
we must claim that = r^. Again, this follows from step [6] of Algorithm 2 and the 
orthogonality of the r^'s. 



4 The secant condition 

In this section we drop the assumption that / is quadratic but continue to assume that 
it is C 1 . We prove that if the line search is exact, then Algorithm 2 satisfies the secant 
condition, which states 

Hk+iYk = mfc 

where H k+ i = li o ■ • • o l k o l k o ■ ■ ■ o t[ 7 that is, the operator that carries — V/(x fc ) to 
nijfc+i, and y^+i = V/(x fc+1 ) — V/(x fc ). The secant condition is usually stated as the 
requirement that H k+ i~y k = akm k [4J. The scaling factor, however, is inconsequential 
because the algorithm can be equivalently presented with a different scaling of Hk] that 
scaling would be canceled in the line search, which would carry out the reciprocal scaling. 

It should be noted that the best known secant algorithms including DFP and BFGS 
satisfy the secant condition regardless of whether the line search is exact, so Algorithm 2 
differs from these algorithms in this respect. 

Checking the secant condition is fairly straightforward algebra as follows. It follows 
from steps [2] and [6] that l^o- ■ -o/f (V/(x fc+ i)) = -g fc and l k _ x o- ■ -o/f (V/(x fc )) = -p*;, 
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hence 



C-i o • • • o Zf(V/(x fe+1 ) 
Next, applying 1% yields: 

K_i ° • " " o ^(V/(x, +1 ) - V/Cxfc)) 



V/(x fc )) = p fc -g fc . 



? (P* - gfc) 




g, 



= Pfc - gfc + 



gfcPfcPfc 
pfPfc 




= Pfc, 



where, to obtain the last line, we invoked ([2]) since the line search is exact. Next, since 
Pfc is an eigenvector of Ik with eigenvalue 1 (again using the fact that the line search is 
exact so g^Pfc = 0), 



Finally, applying li ■ ■ ■ l^-i to both sides and applying statement [3] yields the desired 
result. 

5 Computational results (preliminary) 

In this section we compare Algorithm 2 to BFGS, DFP, Polak-Ribiere conjugate gradient 
(CG-PR+), and Fletcher- Reeves conjugate gradient (CG-FR). Refer to [4] for informa- 
tion about all of these algorithms. In this section we denote Algorithm 2 as SDICOV for 
"steepest descent with iterated change of variables." We report only the number of iter- 
ations. The BFGS and DFP algorithms are implemented using product form rather than 
explicit formation of H^. This means that, like SDICOV, the number of operations and 
storage requirement for the fcth iteration is O(kn) plus a function and gradient evaluation 
(plus additional function and gradient evaluations in the line search). In contrast, CG- 
PR+ and CG-FR require only 0(n) storage and 0(n) operations per iteration. Therefore, 
the iteration counts reported here partially hide the greater efficiency of CG-PR+ and 



The first test is the nonconvex distance geometry problem [2] , a nonlinear least squares 
problem. There are n particles in R 2 whose positions are unknown. One is given the 
interparticle distances for some subset of possible pairs of particles. The problem is to 
recover the coordinates from these distances. Thus, the unknowns are X3, . . . , x n , positions 
of particles 3 to n, each a vector in R 2 . To remove degenerate degrees of freedom, we 
assume the positions of particles 1 and 2 are fixed. The objective function is 



where E denotes the subset of {1, ... , n} 2 of pairs whose distance is given and d^ denotes 
the given distance. This problem has multiple local minima (indeed, global minimization 



hlllk-i ■ • ■ #(V/(x fc+1 ) - V/(x fc )) = pfc. 



CG-FR. 



/(x 3 
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Table 1: Results of distance geometry trials 



Algorithm 


Ave. no. iterations 






^■particle 10 ^particle 


100 


SDICOV 


34 


76 


BFGS 


20 


75 


DFP 


24 


80 


CG-PR+ 


93 


107 


CG-FR 


146 


161 



of this function is known to be NP-hard), so the testing procedure must account for the 
possibility that that different algorithms could converge to different minimizers, which 
could skew iteration counts. To avoid this possibility, we constructed instances with a 
known global minimizer (by first selecting the positions randomly, and then computing 
the interpair distances from those positions). Then we initialized the algorithm fairly close 
to the global minimizer so that all algorithms would fall into the same basin. The line 
search procedure is inexact: it uses bisection with a termination criterion that |0'(a)| < 
O.2|0'(O)|. The convergence tolerance is a relative reduction in the norm of the gradient of 
10~ 5 . Two sizes were tried, namely 10 particles (n = 16) and 100 particles (n = 196). For 
each problem size, four trials were run, and the number of iterations over the trials was 
averaged. The results are summarized in Table [TJ For the smaller problem SDICOV was 
worse than BFGS or DFP, but for the larger problem, the three algorithms have similar 
performance. The two versions of conjugate gradient are slower. 

The second test is a larger class of problems, namely, a finite element mesh improve- 
ment problem. Given a subdivision of a region Q C R 3 into tetrahedra, the problem 
under consideration is to displace the nodes of the tetrahedra in such a way as to improve 
the overall quality of the mesh. There are several measures of quality; we use the ratio 
of the volume of the tetrahedra to the cube of one of its side lengths. The minimum 
such ratio over all tetrahedra is a measure of the mesh quality (the closer to 0, the worse 
the mesh). The details of our method are in j5]. Briefly, we smooth this nonsmooth 
unconstrained problem (nonsmooth because it is maximization of a minimum) by intro- 
ducing an auxiliary variable standing for the minimum ratio and constraints to enforce 
its minimality. The smoothed constrained problem is the solved with a barrier function 
approach. Ultimately, the problem reduces again to an unconstrained problem, except 
the objective function is a smoothed version of the original that involves the logarithms 
of the ratios. 

There is a second source of nondifferentiability that remains in the problem due to 
parametrization of the boundary. For interior nodes in the mesh, the variables in the 
optimization problem are its (x, y, z) coordinates. For nodes on the boundary, however, 
the variables are the (u, v) or t parametric coordinates of the boundary surface. We wish 
to allow nodes on the boundary to move from one parametric patch of a boundary surface 
to another; such movement introduces a nondifferentiable jump in the objective function. 
If the boundary surfaces are smooth, it would be possible in principle to come up with 
smooth local parametrizations that would circumvent this difficulty, but we have not done 
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Table 2: Results of the optimization algorithms on the mesh improvement problem. A 
missing entry indicates failure of the iteration. 





Cylinder 


No. of iterations 

Large cavity Small cavity 


n 


7380 


8775 4254 


SDICOV 


22 


87 373 


BFGS 






DFP 






CG-PR+ 


34 




CG-FR 


81 





so. 

The line search is again based on bisection and enforces the inequality < 
O.7|0'(O)|. It needs a safeguard, since a step too large can invert a tetrahedron, thus 
sending the above ratio to a negative number and hence making the logarithm undefined. 
The initial point for the optimization routine is the mesh produced by the QMG mesh 
generator [3J. 

We tested three problems: a mesh of a cylinder, of a cube with a large spherical 
cavity, and of a tetrahedron with a small octahedral cavity. For this third problem, 
each boundary surface is a single flat parametric patch, so the problem is differentiable 
because there are no parametric jumps. The results of this test are shown in Table [2j This 
problem is again nonconvex and probably has many local minima. In this test case, we 
did not have a means to ensure that the different algorithms find the same minimizer. The 
algorithms, however, returned solutions with comparable objective function values (when 
they succeeded). BFGS and DFP failed in every case in the sense that they terminated 
due to stagnation prior to satisfaction of the convergence termination criterion. Our test 
for stagnation was four successive iterations without significant reduction in either the 
function value or gradient norm. Prior to stagnation, there was generally slow progress in 
these algorithms; for example, the stagnation test required 127 iterations for BFGS and 
126 for DFP in the cylinder case to activate. The two conjugate gradients also sometimes 
failed due to stagnation; CG-PR+ also failed once for producing a search direction that 
was not a descent direction. 

It must be pointed out that we have not implemented a restart strategy for either 
BFGS or DFP. Most modern implementations would have such a strategy, and this would 
presumably ameliorate the difficulty with slow progress. 

6 Concluding remarks 

We propose a new iterative method for unconstrained minimization. The algorithm is 
based on steepest descent after a linear change of coordinates. It is a secant method if 
the line search is exact. It always produces a descent direction except in the case that the 
line search produces a certain degenerate result. 
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In practice, the new method works well on two test cases. Our preliminary results hint 
that, if restarting is not used, the new algorithm in practice is sometimes more robust 
than BFGS and DFP. 
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